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TMI'BFORMATIOKS  FOR  MULTIVARIATE  BINARY  DATA 

by 

P.  Bloomfield 

Summary 

The  interpretation  of  statistical  data  may  often  be  simplified  by  a 
preliminary  transformation.  In  the  context  of  contingency  tables,  one  way 
of  achieving  this  would  be  to  relabel  the  possible  outcomes,  or  in  other 
words  to  permute  the  cells  of  the  table.  For  a  2  table,  certain 
permutations  have  the  property  that  a  loglinear  model  for  the  cell 
probabilities  transforms  in  a  simple  way.  These  are,  in  a  sense,  linear 
transformations  of  the  original  variables. 

The  aim  of  making  such  a  transformation  is  to  fit  the  transformed 
data  by  a  simple  model,  such  as  a  low  order  hierarchical  model  or  one 
in  which  certain  variables  are  independent  of  others.  A  2 4  table  has 
been  analysed  with  this  end  in  view.  All  the  models  were  fitted  to  the 
original  data,  and  to  do  this  a  computer  program  has  been  developed  which 
will  fit  nonhie.rarchical  models  by  iterative  scaling. 
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1 .  Introduction 

It  is  often  found  that  the  analysis  of  statistical  data  may  be  simplified 
by  using  a  suitably  chosen  transformation.  The  most  common  examples  involve 
linear  transformations  of  continuous  variables.  However,  Goodman  (1971) 
gives  an  example  of  a  contingency  table  with  an  apparently  rather  complex 
structure,  which  may  be  simplified  by  describing  the  responses  in  terms  of 
different,  variables.  In  other  words,  the  variables  used  to  index  the  data 
need  to  be  transformed.  Professor  D.  R.  Cox  has  also  mentioned  in  lectures 
the  need  for  study  of  such  transformations. 

ci 

The  simplest  type  of  contingency  table  is  the  2  table,  indexed  by 
d  variables,  each  taking  just  two  values.  The  possible  transformations  of 
such  variables  are  discussed  in  Section  2,  and  the  subset  of  linear 
transformations  is  defined.  Linear  transformations  have  the  advantage 
that  a  factorial- type  model  for  the  probability  distribution,  such  as 
those  discussed  by  Bahadur  (1961)  and  Birch  (1963),  is  also  transformed  in 
a  simple  way. 

The  aim  of  making  a  transformation  is  to  simplify  the  structure  of 
the  data.  For  example,  one  might  look  for  transformed  variables  to  which 
a  simple  hierarchical  model  (Birch,  1963;  Bishop,  1969)  could  be  fitted. 

In  Section  3  we  examine  a  2  4  table  extracted  from  the  data  of  Ries  and 
Smith  (1963),  which  has  also  been  analysed  by  Cox  and  T,auh  uaC7)  and 
Goodman  (1971).  Two  transformations  are  used  to  show  the  type  of 
simplification  which  might  be  achieved.  In  Section  4  we  examine  a  problem 
which  arises  in  the  use  of  the  Deming-Stephan  (Dening  and  Stephan,  1940 ) 
algorithm  to  fit  nonhierarchical  models. 
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2 .  Trans format! ons 

Suppose  that  x  =  (X^ . . .,X^)  is  a  d-variate  random  variable  such  that 
each  xa  takes  the  values  0  or  1  ,  a  -  l, ...,  d  .  The  set  of  possible 
values  of  X  is  thus  1^  ,  the  set  of  vertices  of  the  uniu  '/-dimensional 
hyper-cube.  A  typical  vertex  will  be  denoted  by  i.  (i  ,.»,,i.)  ,  where 
each  i  o  °  or  l  ,  a  «  l,  .  ..,d  .  If  we  draw  a  random  mtple  of  n 
X* s  from  some  distribution  over  I,  ,  the  collection  of  /cunts 

»  Q 

n.  o  no.  of  X*s  taking  the  value  i  ,  iel.  ,  is  a  .2  •  sntingency  table. 

■*>  A*  & 

<y 

One  type  of  transformation  which  has  been  used  on  sucni  data  is  applied 
to  the  cell  counts  n.  ,  iel.  .  Thus  one  might  make  a  variance-stabilising 

*  Cl 

transformation,  or  some  transformation  designed  to  reveal  additivity 
of  structure.  However,  we  are  concerned  in  the  present  paper  with 
transformations  not  of  cell  counts  but  of  the  original  random  variable  X  . 

A* 

In  order  to  preserve  the  information  present  in  X  ,  we  ask  that  the 

*  *v» 

transformation  should  be  invertible,  and  hence  the  range  of  the  tranformed 

cl 

variable  must  contain  exactly  2  points.  It  is  simplest  to  assume  that 
this  range  is  in  fact  I,  ;  thus  a  transformation  of  X  is  simply  a 

„  «v 

permutation  of  1^  . 

Some  of  these  permutations  are  of  course  trivial.  A  re-ordering  of 

the  components  of  X  will  rarely  be  useful,  and  similarly  a  re-coding  of 

cl 

any  component,  that  is  replacing  X  by  l-X  ,  Thus  there  are  d’,2 

U  Lv 

trivially  distinct  versions  of  any  transformation.  However,  this  still 
leaves 

md  o  2dl  P  (2d-i)l  (2.1) 

d’2d  d*. 

non-trivial  transformations  (including  the  identity),  a  number  which 
increases  alarmingly  for  modest  values  of  d.  At  d  »  4  ,  for  instance, 
its  value  is  around  5  x  W10  , 


Clearly  these  transformations  differ  in  the  extent  to  which  they  change 
the  original  variables.  In  the  simplest  case  of  a  2x2  table,  however, 
there  is  essentially  only  one  type  of  transformation.  We  introduce  this 
with  an  example  due  to  D.  R.  Cox.  Consider  an  experiment  in  which  a 
couple  are  asked  their  voting  intentions.  Suppose  that  we  code  the 


responses  as 


I'l  husband  votes  for  Party  D 
husband  votes  for  Party  R  , 


(2.2  ) 


with  X2  carrying  similar  meaning  for  the  wife.  If  political  considera¬ 
tions  carried  no  weight  in  the  choice  of  a  marital  partner,  and  if  there 
were  no  subsequent  interaction,  then  and  Xg  would  be  independent, 
and  would  thus  represent  a  simple  and  useful  way  of  coding  the  responses. 
However,  we  could  also  use  the  coding 


x; » \ , 

r° 


X1  -  X2 
\  j  X2  } 


(2.3) 


here  Xg  records  whether  the  voting  intentions  were  the  same  or  different. 
If  it  emerged  that  X^  and  Xg  were  independent,  then  these  would  be  the 
natural  variables  with  which  to  record  the  responses. 

For  this  2x2  case,  there  is  only  one  other  transformation,  to 
variables 


X2  ,  X”  a  X^  . 


(2.4) 


Since  (2  -  i>5./2'.  «=  3  ,  all  other  transformations  may  be  obtained 

trivially  from  these  three  sets  of  variables. 

It  is  interesting  to  note  that  the  variables  X^  ,  X2  ,  X"  and  X*f 
may  be  written  as  linear  transformations  in  residue  arithmetic  modulo  2  . 
For  X2  n  Xx  +  X2  in  this  arithmetic,  and  this  is  the  only  new  variable 
used.  When  d  >  2  ,  certain  transformations  may  still  be  written  in  this 


way;  for  example,  we  could  have  had  some  third  variable  =  X”  . 

However,  not  all  transformations  are  linear  when  d  >  3  .  The  easiest 

way  to  see  this  is  by  counting.  There  are  2d  -  l  linear  functions  of 

Xj.,**'*Xd  ,  corresponding  to  the  inclusion  or  exclusion  of  each  variable, 

and  omitting  the  null  function  in  which  all  variables  are  excluded.  Thus 

cl 

the  first  transformed  variable  may  be  chosen  in  2  -l  possible  ways. 

The  second  must  be  distinct  from  the  first  and  may  thus  be  chosen  in 
d 

2  •>  2  ways.  The  linear  space  generated  by  thesi  two  contains  3  non¬ 

null  functions;  hence  the  third  variable  must  be  chosen  from  the  remaining 
2d  -  4  .  Continuing  the  argument,  the  total  number  of  invertible  linear 

transformations  is 

.  d  . . ,d  _ . .  d  .  .  ,  d  d-l , 

(2  -1 )  (2  -2){2  -4  ) ... (2  ~2  )  . 

Each  of  these  occurs  in  61  trivially  distinct,  forms;  the  possibility  of 
recoding  any  variable,  that  is  interchanging  l  and  C  has  been  eliminated. 
This  leaves  a  total  of 

(2d~l}(2d-2),.,(2d-2d~1)/dt  , 

essentially  distinct  transformations,  including  as  before  the  identity.  This 
may  be  rewritten  as 


„d(d-l >2 
a  - - 


cHi  ll°'u  ’ 


a  number  which  still  increases  rapidly  as  a  function  of  d 

n,  -  (2d-i)2d~1  n^ 
u  d 

which  may  be  compared  with 


d-l 


HK  m  (2  -1?’. 


m 


d-l 


(2.5) 

However,  since 

(2.6) 

(2.7) 


it  is  clear  that  n,  increases  far  more  slowly  than  ra 

Q  s 


Thus  the  set  of 

linear  transformations  is  an  increasingly  small  subset  of  the  set  of  all 
transformations . 


A  comparison  with  normal  theory  suggests  that  in  a  first  discussion  of 
transformations,  we  should  restrict  our  attention  to  linear  transformations. 
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(There  is,  however,  a  more  compelling  reason  for  doing  this,  to  discuss 

which  we  need  to  consider  the  log-linear  model  (Birch,  1963)  for  the 

k  2 

probabilities  in  a  2  table.  We  begin  by  going  back  to  the  2  example. 

Let  p.  a  pr(X  a  i)  .  pr(X  *  i  ,  X  »  i  )  ,  i  ,  i  -  o,i  .  Assuming  that 

A  M  M  *  I  fa  fc  A  fa 

p .  >  o  for  each  i  ,  we  let  |i  «  log  Pi  .  Then  the  table  of  (-'s  may 

A  M  W 

A# 

be  decomposed  as  in  a  factorial  experiment,  as  the  sun  of  different  com¬ 
ponents.  For  our  purposes,  this  is  most  conveniently  written 


h  •  air,  xa  (-^-S 

M  fa  W 


»  ,  isl-j  > 

•M  >- 


(2.8) 


or  in  full, 


at 

X 

X 

+ 

X 

+ 

X 

;oo 

00 

01 

10 

ii 

*3 

X 

•* 

X 

+ 

X 

X 

;01 
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01 

10 

11 

CS 

X 

+ 

X 

X 

X 

■10 

00 

01 

10 

11 

m 

X 

X 

•* 

X 

4* 

X 

■11 

00 

01 

10 

11 

(2.9) 


) 


The  superscript  t  on  a  vector  or  matrix  denotes  transposition.  Here  X 
is  the  "interaction"  between  X2  and  Xg  .  When  it  vanishes,  X^  and  X2 
are  independent. 

Wow  X'  a  TX  ,  where 

? .  f1  •' 

1  1 


-1  T  -1 

note  that  T  o  T  in  this  residue  arithmetic,  and  also  that  (T  ) 


T 


Thus 


6. 


p.  a  pr(X'  a  i) 

j-  IS*  «w 


=  pr(TX  a  i) 

IVN  IV 

a  pr(X  a  T"1!) 


“  £  -1. 
2  i 

T 

a  exp 

oel, 

«V  b 

x.  (-i)'!  i>  i 

3 

a  exp 

x  (-1)1  J  2 

J 

Vk'-1'— 

/V  IV 

a  exp 

r 

kel 

IV  £ 

(2.10) 


Thus  the  loglinear  model  for  the  probabilities  p..  ,  iel  ,  has  the  same 

£  m  2 

coefficients  as  that  for  p_^  ,  iel2  ,  except  that  they  have  been  permuted 

x 

according  to  the  transposed  linear  operator  T  in  the  sense  that 

/v 

1 

X.  a  X  x,  .  Now  the  condition  for  independence  of  XI  and  X'  is 

K  X  it  12 

IV  W  V 

I  f 

“ 0  n  \i  *  atld  similar]-y  and  X2*  are  intiePen<ient  \0  =* 0  * 
Thus  one  may  tell  from  the  coefficients  in  the  model  for  the  original 

variables  whether  either  transformation  will  give  rise  to  independent 

variables. 

The  same  argument  extends  readily  to  d  >  2  ,  The  expansion  (2.8) 
for  the  log  probabilities  is  still  valid  provided  I  is  replaced  by  1^  . 
A  linear  transformation  can  be  written  as  X'  a  TX  ,  where  T  is  a  matrix 

Ml  MV 

of  zeroes  and  ones,  having  an  inverse  in  residue  arithmetic  modulo  2  * 
it  is  not  in  general  true  that  T-1  a  T  .  The  sequence  of  manipulations 

M  «V 

(2.10)  does  not  depend  on  d  ,  and  hence  it  is  true  for  d  >  2  that  the 

x 

coefficients  are  permuted  according  to  T  ,  that  is  X^  =>  X^x^  ,  kel^  . 

IV  M  M 

The  possible  gains  from  making  such  a  transformation  are  discussed  in  the 
next  section. 


A  different  generalisation  is  to  tables  in  which  each  variable  may 
take  more  than  two  values.  The  most  obvious  generalisation  is  to  tables  in 


-  J7X’ 


which  each  variable  takes  r  values,  o,...,r~i  .  The  natural  arithmetic 
here  is  residue  arithmetic  modulo  r  ,  and  the  natural  decomposition  is  the 
finite  Fourier  transform.  Specifically,  if  v  «  exp(27ti/r)  ,  then 


may  be  written  as 


^  .  log  »  log{pr(X«i ) } 


""  ’  i6Ir,d  > 

«v  ,  U 


(2.11) 


where 


Iy  d  is  the  set  of  all  d-tuples  {i1,...,  i^)  where  o  <  i^  <  r  , 


a  «  l, ...,  d  .  The  inverse  formula,  defining  the  Vs  ,  is 

\  =  (rdj.S  6,  w  “£  i 
£  t  Ar,d  i 


(2.12  ) 


Now  suppose  that  T  is  a  (dxd)  matrix  whose  entries  t^  are  integers 
satisfying  o  <  t  <  r  ,  i,  j  «  l, ...,  d  ,  and  that  there  exists  an  inverse 

*’*  1 J 

T-1  in  residue  arithmetic  modulo  r  .  Lot  X'  «  TX  in  this  arithmetic. 


which  simplifies  to 


pr(X*  o  i)  a  pr(X  a  T  i) 

>v  IV  /w 

{a,”1i  )Ti 

*3  exp  ,zr  \ri  i  t 
icr,d  £ 


,  Vk! 

A*  U  /v  iv 


Thus  as  in  the  2a  case,  the  effect  of  this  transformation  is  simply  to 

T 

permute  the  coefficients  according  to  T  ,  that  is  V  . 


This  generalisation  is,  however,  rather  restrictive  in  its  structure. 
The  Fourier  decomposition  is  mos  b  suitable  when  the  categories  are  ordered, 
but  is  essentially  invariant  under  cyclic  permutations  of  these  categories. 
The  type  of  data  for  which  this  seems  natural  would  be  where  the  categories 
could  meaningfully  be  arranged  in  a  circle,  an  unusual  situation.  Hence¬ 
forth  we  shall  only  consider  binary,  that  is  dichotomous,  variables. 


\ 
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It  should  be  noted  that  the  property  that  the  model  is  transformed  in 
a  simple  way  when  the  variables  are  transformed  linearly  {in  the  present 
sense)  is  not  restricted  to  the  loglinear  model  described  above.  Clearly 
the  same  argument  applies  if  any  function  of  the  cell  probabilities  is 
expanded  in  a  factorial  form,  such  as  in  the  representation  proposed  by 
Bahadur  (1961). 


3.  Motivation,  and  an  example 


The  only  reason  mentioned  as  yet  for  transforming  the  variables  by 

j 

which  a  2  contingency  table  is  classified  has  been  to  attain  independence. 
For  d  >  2  ,  one  will  rarely  be  able  to  transform  to  d  mutually  independent 
variables,  but  one  might  hope  to  find  variables  which  could  be  partitioned 
into  8<  d  mutually  independent  blocks.  Another  possibility  is  that  of 
conditional  independence.  Each  of  these  distributions  may  be  described  in 
terms  of  restricted  loglinear  models. 

The  general  version  of  (2.8)  is 

.t 

i1  0  ,-,-t  (3.1) 


log(pr(X  *  i)}  a  fa  V.  (-Dt  t  ,  ield 

w  d  I. 


In  a  restricted  loglinear  model,  the  summation  is  restricted  to  some  subset 
J  C  1^  ,  Thus  effectively  we  constrain  those  X' s  whose  subscripts  do  not 
lie  in  J  to  be  zero.  Since  exp(XQ)  is  merely  a  normalising  constant, 
we  shall  always  suppose  that  o  e  J  . 

A* 

The  models  of  block-wise  independence  and  of  conditional  independence 
correspond  to  choices  of  J  having  certain  specific  structures;  see 
Goodman  0970).  These  .structures  all  possess  the  property  of  being 
hierarchical,  which  may  be  defined  as  follows.  For  i, jel^  ,  we  write 


^^/J%^iyftr^Jj><?v  A^J/p5^pp^gj| 


9  • 

i  <  j  if  i  <  j  .  a  «  l,  ...,d  .  Then  the  set  J  is  hierarchical  if  for 

W  ""  M  vv  "*  U 

every  j  e  J  ,  J  contains  every  i  <  j  .  It  may  be  seen  from  (3.1) 

#v  »v  "*  *v 

that  the  parameter  describes  the  degree  of  interaction  of  the  variables 

** 

(Xa  :  iQ  =  i)  .  Thus  a  model  is  hierarchical  if  whenever  the  interaction 
of  a  particular  set  of  variables  is  included  in  the  model,  the  interactions 
of  all  subsets  of  those  variables  are  also  included.  The  class  of 
hierarchical  models  is  evidently  a  natural  one  to  consider.  We  observe 
here  that  this  notation  is  in  conflict  with  that  of  other  authors,  who  have 
in  general  indexed  interactions  by  the  list  of  Q' s  for  which  j  =  l  , 
that  is,  by  a  subset  of  (1,...,  d)  .  A  model  then  consists  of  a  family  of 
such  subsets,  ?:.nd  is  hierarchical  if  the  family  of  subsets  is  hierarchical 
in  the  usual  sense.  However,  our  present  notation  seems  to  be  the  most 
natural  one  to  use  in  the  present  context. 

One's. aim  in  maiding  a  transformation  of  the  kind  discussed  here  would 
be  to  find  a  hierarchical  model  which  fits  the  transformed  data,  and 
preferably  a  model  which  displays  extra  structure  of  the  types  mentioned 
above . 

As  an  example,  we  consider  some  data  extracted  from  those  of  Ries  and 
Smith  (1963),  and  shown  in  Table  l.  In  •  a  original  data,  the  quality  of 
water  had  a  third  possible  value  (medium);  we  have  omitted  this  in  order  to 

4 

leave  a  2  table.  As  a  first  step  we  fitted  the  unrestricted  model, 

J  =  I  ;  the  values  of  the  parameters  are  given  in  Table  2.  This  analysis 
amounts  merely  to  a  factorial  analysis  of  the  log  counts.  These  data  are 
rather  unusual  in  that  some  two-variable  interactions  are  smaller  in 
magnitude  than  some  three-variable  interactions.  This  is  similar  to  a 
feature  detected  in  the  complete  data  by  Goodman  (1971). 

However,  if  we  define  new  variables  by  X^  »  Xx  +  X2  modulo  2  , 

X '  o  X  ,  a  «  2,3,4  ,  then  the  parameters  are  permuted  as  has  been 

vv  U 


10. 


described  above.  The  permuted  parameters  are  given  in  Table  3.  In  the 

transformed  table,  the  highest  ranked  three -variable  interaction  ranks  10th 

(down  from  llth),  and  the  highest  ranked  two- variable  interaction  ranks 

13th  (down  from  14th).  These  changes,  while  not  dramatic  in  themselves, 

are  accompanied  by  others  which  also  tend  to  reduce  the  weight  of  the  higher 

order  interaction  terms.  One  simple  way  to  measure  this  is  by  the  sums  of 

squares  of  the  parameters,  shown  in  Table  6.  The  larger  parameter  values 

have  been  permuted  into  ’interactions'  involving  only  one  variable  each, 

that  is  into  'main  effects'.  In  fact,  the  model  in  which  these  new 

2 

variables  X'  are  independent  fits  the  data  well  (X  -  13  on  ll  degrees 

of  freedom). 

However  we  shall  examine  the  data  a  little  further,  to  see  what  else 

may  be  accomplished.  For  example,  one  might  wish  to  have  the  highest 

ranked  three-variable  interaction  rather  smaller.  If  we  define  a  second 

transformation  by  X”  =  Xl  +  X2,  xy  a  a  ■=  then  the  parameters 

are  permuted  as  in  Table  4 .  The  model  of  complete  independence  is  now  less 

tenable  (X  -  18  on  ll  degrees  of  freedom,  between  the  upper  5$  and 

io$  points).  However,  the  model  in  which  all  main  effects  and  all  two- 

variable  interactions  are  included,  but  no  higher  order  terms,  fits  this 

2 

transformation  rather  better.  The  X  values,  each  with  five  degrees  of 
freedom,  are  6.0  for  variables  X  ,  3.6  for  variables  X'  and  1.9 
for  variables  X".  These  values  show  that  this  model  in  fact  fits  the 

A» 

original  variables  adequately.  However,  they  also  illustrate  the 

possibility  of  improving  the  fit  by  transformation;  a  similar  three-fold 
2 

reduction  of  X  in  other  data  cou3d  be  quite  dramatic. 

Goodman  (1971),  examining  the  complete  table,  suggested  that  the 
transformation  X*  a  XL  +  X2  ,  X2  a  Xg  +  X3  ,  X^  »  X?  ,  x'  a  X(j 
would  simplify  the  data.  The  resulting  permuted  paramters  are  given  in 
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Table  5.  The  corresponding  row  of  Table  6  suggests  that  the  model  with  no 
three-  or  four-variable  interactions  should  fit  slightly  better  than  for 

p 

variables  X' '  .  This  is  borne  out  by  a  X  value  of  around  1.5 
(on  5  degrees  of  freedom). 


4 .  Estimation 

The  accepted  procedure  for  estimating  the  parameters  in  a  restricted 
loglinear  model  is  that  of  maximum  likelihood.  Furthermore,  it  has  been 
shown  that  when  the  model  is  hierarchical,  this  fitting  may  be  performed  by 
an  algor;  "-hm  (mown  variously  as  iterative  scaling  and  the  Deming-Stephan 
algorithm;  see  for  example  Ireland  and  Kullback  (1968).  Thus  one  way  to 
search  for  a  suitable  transformation  to  be  applied  to  the  data  would  be 
to  perform  various  transformations,  and  then  to  see  whether  the  transformed 
data  are  adequately  fitted  by  some  hierarchical  model,  as  fitted  by  iterative 
scaling. 

Fortunately  there  is  a  simpler  procedure  available.  For  as  we  have 
shown  above,  the  loglinear  model  transforms  in  a  simple  way  when  the  data 
are  transformed  linearly.  Thus  any  model  for  the  transformed  data 
corresponds  to  a  model  for  the  original  data,  and  hence  may  be  fitted  without 
performing  the  transformation.  An  example  of  this  is  given  in  this  Section. 

However,  a  hierarchical  model  for  the  transformed  data  will  not  in 
general  correspond  to  a  hierarchical  model  for  the  original  data.  Thus 
we  must  consider  the  maximum  likelihood  fitting  of  non-hierarchical  models. 
Fortunately  again,  this  may  be  achieved  by  using  iterative  scaling,  although 


not  in  the  usual  form. 
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The  logarithm  of  the  likelihood  function  for  the  parameters 
(X.  ,  jeJ)  of  a  restricted  model,  given  observed  data  (n.  ,  iel.)  is 

0  l  «,  u. 


log(n*. )  -  i|I  log(n±l )  +  i|I  m  log(pi) 

A#  d  N  M  d  M  ft* 


(4.1) 


where  n  »  En.  ,  and 
i  ’ 


.  T  . 
1  0 


iog(P. )  =  .et  x  (-n„  i 

1  J  to  J 


(4.2) 


The  part  of  this  v/hich  depends  on  the  parameters  simplifies  to 

ih  ki  A.,  \  -  ih  ki  a3  > 

IV  «  IV  U  M  IV  M 


(4.3) 


.say.  The  parameters  are  of  course  subject  to  the  constraint  Ep*.  a  1  . 

Thus  (4.3)  may  be  maximised  by  the  method  of  the  undetermined  multiplier. 
Let 

S(\.  ,  jsJ)  a  S(X)  a  X.  a^. 


*«*  Cl  ~ 


Then 


,t  .T. 

&  a  \  -  B  ,E_  (-i)i  ~  exp(.ET  X  (-i)i  £)  ,  keJ 

Z  ~  d  £  J  £ 


are  the  equations  to  be  solved.  Since  oej  ,  the  first  of  these  is 

%  a  9  iii  exptj!j  Xj  {-1,~  • 

But  aQ  a  n  ,  the  total  number  of  observations,  and  the  sum  on  the  right 

IV 

hand  side  is  constrained  to  equal  l.  Thus  9  a  n  ,  and  the  remaining 

equations  to  be  solved  are 

.  T 


.T  . 

i|I  (-i)i  ~  exp^gj  \j  (-D~  2)  .  aR/n  ,  keJ  . 


Id 


or 


,iTk 


.E  (-1)~  „  p.  a  a  /n  ,  keJ 

1  U 

W  U  A# 


(4.4) 
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A  complementary  set  of  equations,  implied  by  X.  «  o  ,  j£j  ,  are 

0  ~ 

.  T  . 

.ST  (-l)i  £  log  p.  c,  o  ,  j/J  . 

ixd  i 

.T 

let  Pv  be  the  set  of  i  for  "which  (-i)i  £  *  l  ,  and  be  the 


complement  in  1^  of  P^  ,  Then  (4 .4 )  may  be  written 


iip,  Pi  ”  ilw  Pi  “  n  ieP.  nj.  "  n  ifll  ni  * 

d  «M  K  W  1C  M 


I:  i 


M 

? ; 

i  ! 


1 


uS 


r 


where  we  have  rewritten  a^  on  the  right  hand  side  of  the  equation  in  the 

*st 

same  way  as  the  left  hand  side  has  been  rewritten.  But  since  the  sum  of  the 
two  terms  on  each  side  of  the  equation  equals  one,  it  implies  that 

ieP.  Pi  “  n  iip.  ni  * 

A#  K  **  A#  K  A* 


(4.5) 


iftt  pi  "  n  iiil.  ni 

A#  IV  A#  K  *V 


These  equations  have  an  especially  simple  interpretation  when  k  has 

rs» 

only  one  entry  equal  to  l  .  Suppose  Ic^i,  k  «  o  ,  p/a.  Then 

a  p 

P,  9  (iel  :  i  a  o )  ,  and  hence 
ic  ^  a  a 


iip  pi 

A#  VW  A# 


(4.6) 


is  the  probability  that  a  o  .  Thus  the  equations  then  state  that  in  the 
fitted  distribution,  the  probability  that  X  =  o  should  equal  the  observed 

U> 

proportion  of  times  that  this  event  occurred.  If  the  two  terms  on  the  left 
hand  side  of  (4.5)  are  called  a  fitted  one-variable  marginal  subtable, 
then  the  fitted  marginal  subtable  for  has  to  coincide  with  the 

A# 

observed  marginal  subtable. 

When  k  is  not  of  this  form,  the  interpretation  is  not  so  clear. 

A* 

However,  if  the  model  is  hierarchical,  then  it  has  been  shown  that  the 
equations  may  be  grouped,  usually  not  into  disjoint  sets,  in  such  a  way  that 
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each  group  implies  that  a  fitted  subtable  should  coincide  with  an 
observed  subtable.  However,  these  subtables  are  not  in  general  one- 
variable  subtables,  but  describe  the  joint  distribution  of  a  number  of 
variables.  Furthermore,  there  exists  a  minimal  set  of  such  marginal 
subtables . 

In  the  present  context,  a  different  interpretation  is  more  suitable. 
Define  a  transformed  variable  by  Y  =  X  k  modulo  2  .  Then  (1.6)  is  the 
probability  that  Yo  0  ,  and  equations  (4.5)  state  that  the  fitted 
marginal  distribution  of  Y  should  equal  its  observed  marginal  distribution. 
When  only  the  a’th  entry  of  k  takes  the  value  l  ,  Y  =  X ,  and  hence 

rv  W 

this  observation  is  in  agreement  with  our  earlier  statement.  Thus  each 
equation  in  (4.4)  may  be  interpreted  as  forcing  a  fitted  on'-- variable 
subtable  to  coincide  with  its  observed  counter-part,  except  that  the 
distribution  described  by  the  table  may  be  that  of  a  transformed  variable. 

The  Deming-Stephan  algorithm  for  solving  these  equations  is  an 
iterative  procedure.  One  begins  with  a  probability  distribution  belonging 
to  the  model  being  fitted;  since  the  distribution  which  attached  probability 
2  d  to  each  outcome  belongs  to  any  model,  this  is  usually  chosen  as  the 
starting  point.  Each  step  of  the  iteration  consists  of  a  number  of 
substeps,  one  for  each  marginal  subtable  which  is  constrained.  The  substep 
corresponding  to  a  particular  subtable  consists  of  rescaling  the  distribu¬ 
tion  so  as  to  satisfy  the  constraint,  each  of  the  probabilities  which  are 
summed  to  give  one  element  of  the  subtable  being  rescaled  by  the  same  amount. 

In  fitting  a  hierarchical  model,  this  procedure  is  applied  to  the 
minimal  .set  of  marginal  subtables.  For  nonhierarchical  models,  however, 
this  cannot  be  done.  Since  the  nonhierarchical  models  we  are  interested 
in  are  hierarchical  in  terms  of  some  transformed  variables,  one  solution 
to  this  problem  would  be  to  transform  the  variables,  that  is  permute  the 
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table,  and  then  fit  the  corresponding  hierarchical  model.  An  alternative 
solution,  in  terms  of  the  original  variables,  is  to  use  the  set  of  one- 
variable  marginal  subtables  described  above;  this  was  the  procedure  used 
by  the  author.  One  disadvantage  of  this  procedure  is  that  it  may  fail  to 
converge  in  a  finite  number  of  iterations  when  the  more  sophisticated 
version  would,  and  in  general  it  converges  more  slowly.  To  compensate  for 
this,  a  modified  version  was  used,  in  which  instead  of  cycling  through  the 
set  of  constraints  to  be  fitted,  the  one  which  is  most  seriously  being 
violated  is  found,  and  then  the  fitted  table  is  forced  to  fit  it.  It  is 
not  known  how  this  procedure  compares  with  the  usual  one  when  the  model 

being  fitted  is,  in  fact,  hierarchical. 

2 

The  X  goodness  of  fit  statistics  referred  to  in  the  previous 
Section  are  calculated  as  minus  twice  the  logarithm  of  the  likelihood 
ratio,  testing  the  model  being  fitted  against  the  saturated  model,  in 
which  all  parameters  are  allowed  to  be  nonzero.  That  is, 

ix2  ■  ill,  ”i  108  ni  -  n  108  »  -  ili,  ni  108  %  > 

IV  U  N  IV  u  A* 


where  n  •  ,£_  n,  is  the  total  number  of  observations,  and  p.  ,  iel  , 

lei,  X  1  a/ 

IV  Q  **  «V 

are  the  fitted  probabilities  under  the  model  being  tested.  The  last  sum 
may  be  rewritten 

h*  ni  l0C  Pi  "  ill.  ni  fij  ('1,~  ~ 

w  ^  Q  M  IV  N 


iel, 

(V  «  IV 


jeJ  ill  ni  l“1)~  ~ 

IV  A#  A*  v*  A* 


"J  o 


o 

in  the  notation  of  (4.5)  .  Thus  X  may  be  calculated  from  the  data 

n.  ,  iel  ,  and  the  fitted  values  of  the  parameters  X,  ,  iej  .  In  the 

J  <v 


*m. 


t  % 


26. 

computer  program  used  to  fit  the  models  discussed  in  the  previous  Section, 

2 

the  change  in  the  value  of  X  was  used  both  to  select  the  parameter  to  be 
modified,  and  as  a  criterion  for  terminating  the  iteration. 


Acknowledgements 

The  stimulus  for  this  work  was  a  lecture  given  by  Professor  D.  R.  Cox 
in  the  Spring  of  1971.  Gary  Simon's  comments  have  been  invaluable  during 
tho  subsequent  development.  This  research  was  partially  supported  by 
the  Office  of  Naval  Research,  under  contract  No.  0014-67-A-0 151-0017, 
awarded  to  the  Department  of  Statistics,  Princeton  University. 


17. 


References 

Bahadur,  R.  R.  (1961).  A  representation  of  the  joint  distribution  of 
responses  to  n  dichotomous  items.  In  Studies  in  Item  Analysis 
and  Prediction,  Solomon,  H.  (Ed),  158-168.  Stanford  University 
Press,  Stanford,  California. 


Birch,  M.  W.  (1963).  Maximum  likelihood  in  three-way  contingency 
tables.  J.  Roy.  Statist.  Soc.  B  25,  220-233. 

Bishop,  Y.  M.  M.  (1969).  Full  contingency  tables,  logits,  and 
split  contingency  tables.  Biometrics  25,  383-399. 

MM 

Cox,  D.  R.,  and  Lauh,  E.  (1967).  A  note  on  the  graphical  analysis 

of  multidimensional  contingency  tables.  Technometrics  9,  481-488. 


Deming,  W.  E.,  and  Stephan,  F.  F.  (1940).  On  a  least  squares  adjust¬ 
ment  of  a  sampled  frequency  table  when  the  expected  marginal 


totals  are  known.  Ann.  Math.  Statist,  n,  427-444. 

~ '  '  "  r'  '  MM 


Goodman,  L.  A.  (197D.  The  analysis  of  multidimensional 
contingency  tables:  stepwise  procedures  and  direct 
estimation  methods  for  building  models  for  multiple 
classifications.  Technometrics  13,  33-63. 

in  r  -  — . 

Ireland,  C.  T.,  and  Kullback,  S.  (1968),  Contingency  tables 
with  given  marginals.  Biometrika  55,  179-188. 

^  ^  MW 

Ries,  P.  II.,  and  Smith,  H.  (1963).  The  use  of  chi-square  for 
preference  testing  in  multidimensional  problems.  Chem. 
Eng.  Progress  Symposium  Series  No.  42,  39-43. 


X2  a  0 

Xg  °  i 

KB 1 

mm 

Bi 

1 

X5  *  o 

52 

68 

52 

37 

x3  «  1 

30 

42 

43 

24 

1 

X-  3=  0 

53 

63 

49 

57 

Xj  a  1 

27 

29 

29 

19 

/O  preferred  brand  i-I 

1 1  preferred  brand  X 


previous  non-user  of  M 
^ l  previous  user  of  M 


low  te  ripe  rat  ure 
high  temperature 


hard  water 


soft  water 


*2.8361  (16) 


0.2955  (15) 


0.0492  (9  ) 


-0.0887  (13) 


W 


Table  3 


^  ^  i  i  >  with  rank 

l,  2,  3,  4 

olute  value)  in  parentheses 


,  o 


X2  "  1 


lj  »  0 


*1  3  1 


-0.1278  (14) 


0.0836  (12) 


0.0216  (5) 
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Values  of  X,  B  \. 

i  i.  x 


,  with  rank. 


1,  2,  3,  4 


(in  terms  of  absolute  magnitude)  in  parentheses 


Table  5 


Values  of  A*  „  A?  .  .  ,  with  rank 

1  1l,  12,  S,  \ 

(in  terms  of  absolute  magnitude)  in  parentheses 


-2.8361  (16) 

-0.1278  (14  ) 

0.0148  (2) 

-0.0531  (10) 

0.2955  (15) 

0.0490  (8) 

0.0836  (12  ) 

0.0216  (5) 

0.0492  (9) 

-0.0633  (11  ) 

0.0  364  (7) 

0.0313  (6) 
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Table  6 

Sur.is  of  squares  of  fitted  parameters, 
grouped  by  level  of  interaction 


Number  of  variables  in  interaction 


2 


Number  of  such  interactions 

4 

6 

4 

1 

variables  X 

IV 

0.0972 

0.0279 

0.0087 

0.0001 

variables  X* 

A# 

0.1131 

0.0153 

0.0046 

0.0010 

variables  X" 

•V 

0.1065 

0  .0244 

0.0016 

0.0013 

variables  X* 

0.1063 

0.0254 

0.0019 

0.0003 

